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AESTRACT 

The main thesis of this paper is that testing is now 
ready for a major effort to create a synthesis out of what has 
hitherto been a series of unrelated approaches to testing. The paper 
proposes the methods of Measurement, Evaluation and Assessment as 
three very different approaches to the problem of testing, and 
discusses their characteristics* It further suggests ways in which 
some of the powerful aspects of each testing approach may be brought 
together into a more complex and useful way of handling test 
problems. Interactions between these approaches and consequences of 
the proposed synthetic theory are also discussed. (Author) 
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The CENTER FOR THE STUDY OF EVALUATION OF INSTRUCTIONAL 
PROGRAMS is engaged in research that will yield new ideas 
and new tools capable of analyzing and evaluating instruc- 
tion. Staff members are creating new ways to evaluate con- 
tent of curricula, methods of teaching and the multiple 
effects of both on students. The CENTER is unique because 
of its access to Southern California's elementary, second- 
ary and higher schools of diverse socio-economic levels 
and cultural backgrounds. Three major aspects of the pro- 
gram are 

Instructional Variables - Research ih this area 
will be concerned with identifying and evaluating 
the effects of instructional variables, and with 
the development of conceptual models, learning 
theory and theory of instruction. The research 
involves the experimental study of the effects of 
differences in instruction as they may interact 
with individual differences among students. 

Contextual Var iables - Research in this area will 
be concerned witn measuring and evaluating differ- 
ences in community and school environments and the 
interactions of both with instructional programs. 

It will also involve evaluating variations in stu- 
dent and teacher characteristics and administrative 
organization. 

Criterion Measures - Research in this field is con- 
cerned with creating a new conceptualization of eva- 
luation of instruction and in developing new instru- 
ments to evaluate knowledge acquired in school by 
measuring observable changes in cognitive, affective 
and physiological behavior. It will also involve 
evaluating the cost-effectiveness of instructional 
programs . 
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TOWARD A THEORY OF TESTING WHICH INCLUDES 
MEASUREMENT - EVALUAT I ON - ASSESSMENT 

Benjamin S. Bloom 

In the 60 years since Binet first introduced his intelligence 
test, testing has become the pride and despair of psychology and 
education. Testing runs like a powerful minor theme through most 
of the research and the applied work in these fields. We take 
pride m testing because it is the one area which has shown clear- 
est development and most widespread use in these two fields. Our 
sophistication has grown rapidly in testing, and we know what we 
know and we know what we don't know in such clear ways that we can 
take advantage of the former while we attempt to reduce the latter. 

But, our despair arises from the overuse of testing, its 
tendency to dominate both psychology and education, and the nega- 
tive effect it sometimes has on human relations. Especially in 
education, testing is a two-edged sword which can do incalculable 
good as well as great harm to the individual. The recent reaction 
against intelligence testing in the large city schools, although 
emotional and in many ways misguided, brings home to us that chil- 
dren are judged in terms of test results and that faith in one 
child's ability to learn or rationalizations of a teacher's in- 
ability to teach another child are both related to test scores. 

To control the matriculation examinations of a country is to 
control its educational system, to develop tests which are widely 
used for selection and prediction purposes is to determine which 
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human qualities are prized and which are neglected, to develop 
instruments which are frequently used to classify and describe 
human beings is to alter human relations and to affect a person’s 
view of himself. 

It is no great exaggeration to compare the power of testing 
on human affairs with the power of atomic energy. Both are capa- 
ble of great positive benefit to all of mankind and both contain 
equally great potential for destroying mankind. If mankind is to 
survive, we must continually search for the former and seek ways 
of controlling or limiting the latter. What is needed in testing 
is a clearer understanding of what we have been doing and a new 
synthesis of our disparate methods and concepts in testing. Perhaps 
I can describe a few terms necessary for such a synthesis. 

What I propose to do is to describe briefly three very dif- 
ferent approaches to the field of testing, indicate why a new 
synthesis of these is in order at this time, and suggest some of 
the directions such a new synthesis could take. I do hope that 
I can impress you with the great need for such a synthesis even 
though you may be reluctant to accept my suggestions for the 
synthesis . 



Three Approaches to Testing 

If we view testing as a systematic method of sampling one or 
more human characteristics and the representation of these results 
for an individual in the form of a descriptive statement or classi- 
fication, we can discern three very different approaches to this 
problem. For purposes of convenience, I will refer to these 
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approaches as Measurement, Evaluation, and Assessment. I am sure 
that some of you will use other terms to describe these approaches. 
However, the problem is not the accuracy or meaningfulness of the 
terms, but how to discern the very basic differences underlying 
these approaches and the contrast among them in the assumptions 
they make about the world, about man, and about the nature of 
evidence . 

Measurement * 

Perhaps the first approach (historically) to testing 
human characteristics began with the work of Galton and Binet. 
Although they differed in many respects, what they had in common 
was the development of standard stimuli, tasks, and questions. 

The subject’s responses to these standard situations were to be 
appraised in terms of speed and/or accuracy- -where accuracy was 
to be judged in a standard way--by all trained testers. The 
results for each examinee were translated into some quantitative 
form (I.Q., raw score, time of response, etc.), which was then 
given further meaning by relating it to the normative data for a 
given sample of individuals. 

Since testing under this approach usually involves a sample 

of the individual’s responses at a particular point in time (and 

at a particular point in the individual’s career ) there has been 

a great concern for determining the error of the sample by means 

of methods for estimating the reliability and objectivity of the 

score assigned the examinee. The meaningfulness of the results 

*Some illustrations of the measurement approach are Terman and 
Merrill (1959), Thurstone (1938), Strong (1943), Gulliksen (1950), 
and Hathaway and McKinley (1951) . 
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has been usually determined by some form of concurrent or pre- 
dictive validity. That is, the validity of a measurement instru- 
ment is usually approached in terms of its relation with another 
measurement or appraisal. 

Although the measurement view has not entirely ignored the 
environment in which the individual has developed, the environ- 
ment is generally ignored at the time of making the measurements. 
What a measurement specialist does is to attempt to take into 
consideration the environment as an error term, since he assumes 
that his measurements are accurate to the extent to which the 
examinees have had ’’equal opportunity” to develop the character- 
istics being sampled. However, the measurement approach does 
seek characteristics which are ”in the individual.” That is, the 
individual is the possessor of I.Q., ability, creativity, etc., 

and he is to be measured to determine the amount of each character 
istic he possesses. 

In measurement there is an assumption that the same character 
istics (I.Q., memory, etc.) can be measured in all men--no matter 
what their background-and that the characteristics can be measured 
in an analogous way at different times and at different places. 

I.Q. is very similar in 1967 and in 1917 in the United States, 
France, or India. 

The use of the tests under the measurement view is largely 
for classif ication^ prediction, and experimentation. The major 
quest in measurement is for a small number of dimensions or mea- 
sures which will completely account for the variance of a criter- 
ion when put together in some additive or summative combination. 
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The problems which are most alive in measurement today are the 
search for better units (hopefully with properties akin to phys- 
ical measurement units) , the search for a parsimonious measure- 
ment system which will account for the variance of a large number 
of variables or measures, and the search for improved methods of 
sampling characteristics and individuals. 

The great power of measurement is in its great efficiency. 
Given a dimension or a criterion, psychometric procedures enable 
measurement to secure parsimonious procedures for measuring it 
and for describing it in terms of a small number of dimensions. 

Evaluation * 

Starting in the 1930 f s, Ralph Tyler (1934) proposed that 
educational testing be concerned with the changes in students 
produced by educational means. He used the term evaluation to 
refer to a set of procedures for appraising changes in students. 

The stress on appraisal of change meant that, theoretically 
at least, testing had to be done at two or more points in time on 
each individual to determine the extent of change. Since it was 
necessary to limit the types of changes to be tested, Tyler sug- 
gested that tests be constructed to sample the changes in students 
specified by the objectives of instruct ion- - that is, the changes 
which were intended by the instructors, instruction, or the 
curriculum. 

While the evaluation approach is concerned with the reliabil- 
ity, objectivity, and efficiency of the tests used, these are 
secondary questions. Its primary concern is with the content 



*Some illustrations of the evaluation approach are Smith and Tyler 
(1942) , Furst (1958) , Bloom (1956) , Dressel and Mayhew (1954) . 



validity of the instruments developed. That is, there must be an 
adequate definition of the objectives or characterisics to be 
appraised and a search for ways of testing these characteristics 
which appropriate experts can agree are sampling the desired 
behaviors. Once it has been possible to construct a valid test 
of the objective, it is possible to use concurrent validity to 
determine more efficient and parsimonious instruments to test the 
same objective (using the valid test as the criterion). Reliability 
and objectivity can then be improved until they reach the desired 
standard. 

It should be pointed out that evaluation is concerned with 
securing evidence on the attainment of specific objectives of 
instruction. As the objectives become more varied in nature, it 
is to be expected that a greater variety of types of evidence may 
be appropriate. Thus evaluation evidence may include products 
developed by students, processes in which they engage, and behav- 
iors they manifest in a great variety of situations. The evidence 
may be qualitative as well as quantitative. This is a far cry from 
the standard stimulus -standard response evidence gathering in 
measurement . 

Evaluation follows the objectives of instruction. Therefore, 
to the extent that objectives differ from teacher to teacher, 
school to school, or curriculum to curriculum, it is necessary to 
devise evaluation procedures appropriate to the specific situations. 
A single standard test may not be equally appropriate to all 
situations . 

Although evaluation is primarily concerned with changes in 
individuals, it may be applied to evaluating the effects of a 
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curriculum, a course, a teacher, a method of instruction, etc. 

For such problems where the concern may be with group changes 
rather than individual changes, it is possible to utilize student- 
test sampling methods which will yield evidence about the group 
rather than the individuals. 

Since evaluation attempts to appraise the changes in students, 
it is necessary to find methods to judge the extent to which the 
objectives have been met. The standard against which the evidence 
is appraised may be the usual type of normative data on particular 
samples, it may also include absolute criterion- ref erenced stan- 
dards, and it may even include the student as his own standard- - 
for example, the change in the student over one period of time as 
contrasted with the change in that student over another period of 
time . 

Evaluation need not be confined to a summative combination 
of items or scores. Various patterns of responses may be inter- 
preted to determine the types of changes taking place in the stu- 
dent, the types of errors he makes, and the reasons underlying 
his attainment or lack of attainment of the objectives specified 
for instruction. 

In measurement, the environment is a source of error in the 
scores or attainments of the individuals being measured. In 
evaluation, the environment (instruction, class, school, etc.) is 
assumed to be the major source of the changes. Ideally, evalua- 
tion is as much concerned with the characteristics of the environ- 
ment which produces the change as it is with the appraisal of the 
changes in the individuals who are interacting with the environment. 
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In practice, the evaluator frequently limits himself to a descrip- 
tion of the environment while he appraises in detail the changes 
taking place in the individuals. 

One major use of evaluation has been to classify individuals 
for purposes of grading, certification, and placement or promotion. 
Perhaps of equal importance is the use of evaluation to determine 
the effectiveness of a method of instruction, a specific course, 
curriculum or program, or a specific instructor. Evaluation may 
be used in education experimentation, and it can be used as a 
method for maintaining quality control in education. 

Perhaps a major difference between measurement and evaluation 
is the recognition (and utilization) of the effects of testing on 
the persons involved. Characteristically, measurement strives to 
limit or control the effects of testing on the student performance. 
Measurement’s concern with "equal opportunity" usually is directed 
to limiting or equalizing the opportunity students have to learn 
about the sample of problems on which they will be tested. In 
contrast, in evaluation there is a more explicit concern with stu- 
dent growth or change and with the utilization of the effects of 
testing to promote such change. Thus, it is recognized that both 
teachers and students can be motivated to teach and learn by the 
nature of the tests they anticipate will be used--this effect can 
be maximized or minimized as desired. Furthermore, the transla- 
tion of objectives into testing situations has the effect of giving 
operational definition to the desired characteristics- -and, in 
turn, such operational definition can focus and intensify the 
development by teacher and students of these desired characteristics. 
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